arXiv:1508.04556vl [stat.ML] 19 Aug 2015 


Spatio-temporal spike and slab priors for MMV problems 

Michael Riis Andersen, Ole Winther & Lars Kai Hansen 
DTU Compute, Technical University of Denmark 
DK-2800 Kgs. Lyngby, Denmark 
Email: {miri, olwi, lkh}@dtu.dk 


Abstract —We are interested in solving the multiple measurement vec¬ 
tor (MMV) problem for instances, where the underlying sparsity pattern 
exhibit spatio-temporal structure motivated by the electroencephalogram 
(EEG) source localization problem. We propose a probabilistic model that 
takes this structure into account by generalizing the structured spike and 
slab prior and the associated Expectation Propagation inference scheme. 
Based on numerical experiments, we demonstrate the viability of the 
model and the approximate inference scheme. 

I. Introduction 

The multiple measurement vector problem (MMV) m is given by: 

Y = AX + E, (1) 

where A e is the forward matrix, Y C is the 

measurement matrix, X = [xi X2 ... xt] C is the 

desired solution and E C R^^^ is a matrix of corruptive noise. 
We are interested in finding sparse solutions to eq. G) in the 
ill-posed regime, where N < D. Furthermore, the sparsity pattern 
of X is assumed to have certain structural properties. In particular, 
we are considering problems where the sparsity pattern exhibit 
spatio-temporal structure as in EEG source localization G), 0] or 
in background subtraction in computer vision n. Let zt be an 
indicator for the support of xt, i.e. zt = I[xt / 0], then zt is 
assumed to be spatially correlated. Furthermore, we assume that the 
support vectors zi,Z 2 ,...,zt slowly evolve through time as well - 
rendering the joint sparsity assumption invalid ID. 

The main contribution of this work is to propose a model for 
spatio-temporal sparsity patterns by extending the structured spike 
and slab prior m to account for temporal evolution of the sparsity 
pattern as well. Furthermore, we demonstrate the benefits of the 
model through numerical experiments. 

A. Related work 

The field of structured sparsity has received a great deal of attention 
in the recent years. In this section we highlight some of the related 
work, but this list is by no means exhaustive. The LASSO-community 
have introduced the Group and Graph LASSO methods, which 
generalize the standard ^i-norm minimization approach to promote 
different kinds of structured sparsity Q. In the probabilistic setting, 
the standard workhorse for sparsity is the so-called spike and slab 
prior im. This has also been generalized to model group sparsity 
im and cluster sparsity |[T0l. In the context of compressed sensing 
mi , Cevher et al. [i4| used a Markov random field to enforce spatially 
correlated sparsity patterns, whereas Ziniel et al. used binary Markov 
chains to model temporally correlated sparsity patterns 113 . 

II. The structured spike and slab prior 

In this section we briefiy introduce the conventional spike and slab 
prior im and the structured spike and slab prior before we move 
on to the spatio-temporal spike and slab prior on the next section. The 
conventional spike and slab prior decomposes each Xi^t as a product 


of a binary variable zi^t and a real number Ci,t, i.e. Xi^t — 
where Zi^t ~ Ber (po) and Ci^t ~ M (0, tq) for i G {1, 2,.., D} and 
t G {1,2,..,T}. The structured spike and slab prior generalized this 
formulation by imposing structure on the binary variable for each 
time t as follows 

D 

p(zt|0(7t)) = jqBer (zi,t 10(7i,t)) , (2) 

i=l 

p(7t) = A/'(7t|M*,St) , (3) 

where the Bernoulli probabilities are parametrized using the standard 
normal CDF 0 : R ^ (0,1). The hyperparameters fxt and Et 
encode the prior belief of the support for time t. Specifically, the 
prior mean value fit controls the prior belief of the number of 
non-zero variables and the covariance matrix Et determines the 
prior correlation of the support at time t. Thus, we can impose 
structure on the binary support variables zt by means of imposing 
generic covariance functions on 7. For example, say we choose 
to be the squared exponential covariance function, then the resulting 
prior distribution will promote sparsity patterns where neighbouring 
support variables have the same state. Under the other hand, when 
E is diagonal, we recover the independent spike and slab prior. 


The marginal prior probability of the Xi^t being non-zero is 
given by 


p{zi^t = 1) = y* p{zi^t = 


l^i,t 




(4) 


Thus, if the prior on 7^ has zero mean, then the prior belief of p{zi^t) 
is unbiased, i.e. p{zi^t) = 0.5. On the other hand, if pa^t is negative, 
the prior belief of zi^t is biased towards zero and vice versa. 


III. The spatio-temporal spike and slab prior 

In this section we describe the temporal extension of the struc¬ 
tured spike and slab prior. Instead of considering fit and Et as 
fixed hyperparameters, we propose to impose a prior on F = 
[71 72 ... 7 t ] to model problems where the support of the 

solution X changes over time. In particular, we impose a first order 
process Markov process on F to model the slowly changing sparsity 
pattern 

P (7*|7t-i) = .^( 7*1 (1 - a) Mo +a7t-i,/3So) , (5) 

where the hyperparameters a and /3 control the temporal correlation 
and the ’’innovation” of the process, respectively. Furthermore, we 
assume that the prior distribution on 71 is given by 

p( 7 i) = ^(711^0,20) . 


( 6 ) 




Under these assumptions the marginal distribution of 72 becomes 

^(72) = J P (72171) p( 7 i)d 7 i 

= A/" (72IM0, (a^ +/ 3 ) So) . ( 7 ) 

Therefore, it follows by induction that if a and /3 satisfy + /3 = 1 , 
then the marginal distribution of jt is pijt) = A/^(mo 5 ^0) for all 
t. Furthermore, we also see that for a = 1 and /3 = 0 , the prior 
reduces to the structured spike and slab prior in the joint sparsity 
setting. In the other extreme, at a = 0 and /3 = 1 , the prior reduces 
to the structured spike and slab prior in the time-independent setting. 
Hence, the spatio-temporal spike and slab prior can be seen as a 
generalization of the two extreme cases. 

This choice of model is also motivated by the fact that the 
first order structure in the temporal dimension gives rise to an 
inference scheme that scales linearly in the number of time steps T 
as we will see in the next section. 

IV. Bayesian Inference using the spatio-temporal spike 

AND SLAB PRIOR 

The goal of this section is to describe an inference procedure for 
solving the problem in eq. Q using the proposed prior in a fully 
Bayesian setting. We combine the spatio-temporal spike and slab 
prior with a time-independent isotropic Gaussian noise model of the 
form 

T 

p(Y|X) = P[A/'(y*|Axt,agl). (8) 

t=l 

This gives rise to the following joint distribution 

T 

p(Y, X, Z, r) =P[ V (yt I Ax,, (To'l) 

t = l 

" -V-' 

/l(X) 

T D 

nn 

t = l i=l 

^ v- 

/2(X,Z) 

T D 

nii®®^(^'’‘i'^Li,t)) 

t=l i=l 

" -V-" 

/3(z,r) 

T 

A/'(7i|/Lto, X;o)]^A/'(7t|(l-<T)/Lto + 

t=2 


The desired posterior distribution p(X, Z,r|Y) is obtained from 
Bayes’ Rule ca. Unfortunately, this posterior distribution is in¬ 
tractable due to the product of mixtures and hence, we have to 
settle for approximate inference. Specifically, we use Expectation 
Propagation IT 4 l-il 6 l for approximate inference by extending the 
proposed inference scheme in ii. 

A. Approximate Inference using Expectation Propagation 

Expectation propagation (EP) is an iterative deterministic method 
for approximating probability distributions using simpler distributions 


from the exponential family. As indicated in eq. the exact 
posterior can be decomposed as follows 

T T D 

p(X, z, r|Y) =- jq (x,) H n {xi,uZi,t) 

t=l t=l i=l 

T D T 

nn f 3 ,i,t ^i,t) / 4 ,t (7^) ? ( 10 ) 

t = l i=l t=l 

where Z = p(Y) is the normalization constant. Moreover, note that 
each factor in the decomposition only depends on a subset of the 
variables in the model, i.e. / 2 ,z,t depends only on the variables Xi^t 
and Zi^t and so on and so forth. The EP framework takes advantage 
of this decomposition by approximating each factor in eq. Go) with 
a distribution from the exponential family. First we describe the 
functional form of the approximation and then we briefly explain 
how to estimate the parameters of the approximation using the EP 
algorithm. 

Let /i,t denote the approximation of /i,t etc. First, we note 
that each of the factors in the first term, i.e. /i,t for all t, are already 
a member of the exponential family and hence does not have to be 
approximated. Therefore, for each t we have 

fi,t (xt) = ff , ( 11 ) 

where the parameters are determined by and 

’ *^0 

Vf} = ^ A^A. Note that the exact term fi^t is a distribution on yt 

’ ^0 

conditioned on xt, whereas the approximate term /i,t is a function 
of xt that depends on yt through mi,t and etc. Next, we turn to 
the factors in the second term, i.e. Since each of these factors 

depends on Xi^t and Zi^t, we choose / 2 ,z,t to be 

= A/” Ber {zi^t\f {l 2 ,i,t)) , (12) 

where rfi2,i,t, V2,i,t and 72,z,t have to determined using the EP 
algorithm. Based on similar arguments f3,i,t and /4,t are chosen 
as follows 

[Zi^t\(l>{j 3 ,i,t)) Af (^J 3 ,i,t\fl 3 ,i,t,^ 3 ,i,t^ , ( 13 ) 

A, =V( 7 t|A 4 ,t,S 4 ,t) . ( 14 ) 

Note that /4,i does not have to approximated either, it is simply 
/4,i = A/^ (71 |/xo, Xlo). Furthermore, note that the approximations 
to the factors /4,t for all t do not factorize w.r.t. 7t,i, 7^,2,... in 
order to capture potentially strong correlations in the support. 

After specifying all the individual approximation terms, we derive 
the joint approximation of the desired posterior p(X, Z, r| Y). Since 
the exponential family is closed under products, the approximate 
joint distribution has the following form 

T T D 

Q(X,Z,r) =]^A/'(^xt|mt, nn Ber (^z,t|0(7i,t)) 

t=l t=li=l 

T 

]qv( 7 t|At,S*) • ( 15 ) 

t=l 

Let m2,t = [^2,17,^2,27,... ,m2,L»,t]^ and V27 = 

diag (V2,i,t, V2,2,t, ..., V2,D,t^, and analogously for /is, X3 






and 73 , then the parameters of the joint approximation are given by 


Vt 

= (Tm 


(16) 

lilt 

= Vt (vyiiii.t + vyrh 2 ,t), 

(17) 

St 

= (S3. 

t + S 4 ,t) 

(18) 

P-t 

= st(: 

^ 3 ,tA 3 ,t + S 4 _J/i 4 ,t) , 

(19) 


= </>■' 

/ (1 - 0(72,i.t)) (1 - 0(73,i.t)) 1 

. ( 20 ) 


The posterior covariance matrices Vt and Xlt are (potentially) fully 
dense matrices, which makes the approximation able to cope with 
non-orthogonal forward matrices A. 


B. The Expectation Propagation Algorithm 

In this section we describe how to compute the parameters of the 
individual approximations using the EP algorithm. The EP algorithm 
works by updating each of the individual approximation terms one 
by one until convergence. Consider the update of the term fa,i,t 
for a given a, i and t. The update is obtained by performing the 
following three steps of the EP algorithm. The first step is to remove 
the contribution of fa,i,t from the joint approximation in eq. Oby 
forming the so-called cavity distribution 

Q\a,i,t ^ ^ 21 ) 

fa,i,t 

In the next step we minimize the Kullbach-Leibler (SI divergence be¬ 
tween and w.r.t. That is, we minimize 

KL (^ Zai t where Za,i,t is the normalization 

constant of For distributions within the exponential 

family, minimizing this form of KL divergence amounts to matching 
moments between and |[14). Finally, the third 

and last step is to compute the new update of fa,i,t as follows 


fa,i,t OC 


Q 


a,t ,new 


Q\a,z,i 


( 22 ) 


After the individual approximation terms fa,i,t for all i and t 
for a given a have been updated, the relavant parts of the joint 
approximation are updated using eq. OB-®. To minimize the 
computational load, we use parallel updates of / 2 ,z,t 121 followed 
by parallel updates of / 3 ,z,t rather than the conventional sequential 
update scheme. Furthermore, due to the fact that /2 and fs 
factorizes w.r.t. both i and t, we only need the marginals of the 
cavity distributions which simplifies the computations. 

Computing the cavity distributions and matching the moments are 
straightforward. However, when matching the moments, we are 
required to evaluate of the zero’th, first and second order moment 
of the distributions of the form (ji\p.i,'Eii). Derivation 

of analytical expressions for these moments can be found in the 
appendix to chapter 3 in CD. 


• Initialize approximation terms fa for a = 1, 2, 3,4 and Q 

• Repeat until stopping criteria 

- For each f2,i,t- 

* Compute cavity distribution: oc 

J2,i,t 

* Minimize: KL(/2.i,tQ\^’'’*| |Q^’‘’“'*) w.r.t. 

* Compute: f2,i,t oc Q\2,i,t to update parameters 

'^2,i,tiV2,i,t and 

- Update joint approximation parameters: m, V and 7 

- For each/3,z,t: 

* Compute cavity distribution: oc 

J3,i,t 

* Minimize: w.r.t. 

~ ^3,f,new 

* Compute: oc Q\3,i,t to update parameters 

^rid 73 ,z,t 

- Update joint approximation parameters: jl, ^ and 7 

- For each /4,t 

* Compute cavity distribution: oc 

J 4,f 

* Minimize: KL(/4,tQ\‘‘’*| w.r.t. 

~ Q4,t,new 

* Compute: 74,t oc to update parameters 

m4,t, V 4 ,t and 747. 

- Update joint approximation parameters: /i, T, 

Fig. 1. Proposed algorithm for approximating the joint posterior distribution 
over X, Z and F conditioned on Y. 


C. Tuning of hyperparameters 

The algorithm requires tuning of multiple hyperparameters for opti¬ 
mal performance. The Expectation Propagation framework provides 
a neat alternative to typical cross-validation schemes. Besides the 
approximation to the posterior distribution P(X,Z,r|Y), EP also 
provides an approximation to the marginal likelihood P(Y), which is 
very useful for model selection and tuning of hyperparameters ca. 
The exact marginal likelihood is obtained by marginalizing out X, Z 
and r from the joint distribution in eq. (|9). The EP approximation 
to the marginal likelihood is obtained by substituting all the (scaled) 
individual approximation terms into the resulting formula. Finally, it 
is also possible to get closed form expression for the gradients of the 
marginal likelihood approximation w.r.t. to the hyperparameters on, 
( 13 , which allows efficient tuning of the hyperparameters. However, 
a detailed treatment of the marginal likelihood approximation and 
its gradient w.r.t. hyperparameters are out of scope for this extended 
abstract. 


V. Numerical Experiments 

In order to investigate the properties of the proposed algorithm, 
we have designed and conducted two numerical experiments. The 
first experiment addresses the reconstruction performance of the 
algorithm, whereas the second experiment investigate the algorithm’s 
robustness towards coherent forward models. 


The proposed EP algorithm is summarized in figure \T\ The 
computational complexity of the algorithm is dominated by the 
matrix inversions in eq. and fT^ . However, when A < D, the 
covariance matrices have low rank and hence, eq. (QB can be 
carried out in O using the Matrix Inversion Lemma lITSl . 

Therefore, the resulting inference scheme scales as O (TD^^, i.e. it 
scales linearly in the number of measurement vectors T. 


A. Experiment 1 

To evaluate the proposed method, we have compared the method 
to several related solvers: BG- AMiQ dD, DCS-AMlfl (20), Spatial 

^We used the implementation in GAMP-toolbox by Sundeep Rangan et al: 
http://gampmatlab.wikia.com/wiki/ 

^We used the implementation in the DCS-AMP-toolbox by Justin Ziniel: 
http ://w w w2. ece. ohio- state. edu/ ~ zinielj/dcs/ 












EP (implements the structured spike and slab prior) (h) and Spatial 
MMV ER The BG-AMP method combines the conventional spike 
and slab prior with approximate message passing-based ED 
inference. We include this method to have a baseline result without 
any structural assumptions on the sparsity pattern. The DCS-AMP 
can be seen as an extension of BG-AMP, which assumes that the 
sparsity pattern evolves slowly in time according to a binary Markov 
chain. The Spatial EP method assumes spatial correlation in the 
sparsity pattern, but no temporal correlation. Einally, the Spatial 
MMV method is similar to Spatial EP but with static sparsity across 
time, i.e. it assume joint sparsity across time. 

To set up the first test we first sampled one realization of Z 
using eq. ©-([S]) with D = 100, T = 100, a = 0.99 and 
/3 = 1 — see figure |4(a)[ The average number of non-zero 
weights per column is fixed to 20. We note that the resulting 
sample exhibits the spatio-temporal structure as desired. Afterwards, 
we sample the nonzero coefficients in X from a standard normal 
distribution and from these we generate compressive measurements 
using eq. Q, where Ai, ~ A/'(0,1/iV), the SNR = lOdB and the 
undersampling ratio N/D is varied from from 0.05 to 0.95. To 
quantify the performance of the methods we use Normalized Mean 
Square Error (NMSE) between the true X and the estimated X 
given by 

NMSE = - ^ y, -(23) 

Eurthermore, we evaluate each method’s ability to recover the true 
support Z using the E-measure ll22l based on a MAP estimate of the 
support Z, 

p ^ o precision ■ recall ^^ 4 ) 

precision + recall 

The results are averaged over 100 realizations of the noise E and 
non-zero coefficients in X and are shown in figures l2ll^ It is seen 
that the proposed spatio-temporal method outperforms the other 
methods both in terms of NMSE and E-measure, but in general 
it is seen that richer prior assumptions on the support improves 
the results significantly. We also note that for very undersampled 
problems, the Spatial MMV EP method with static sparsity actually 
performs best. But as the undersampling ratio increases, all the other 
methods, including BG-AMP, outperforms it due to the very high 
bias of the model. 

Eigures |4(b)||4(f)| shows the reconstructed support sets for the 
undersampling ratio N/D = 0.4. It is seen that DCS-AMP and 
Spatial EP, which models temporal and spatial structure, respectively, 
clearly outperforms BG-AMP Eurthermore, it is also seen that joint 
sparsity assumption (fig. |4(e)| ) is too restrictive for these kinds of 
signals. Again, we note that the spatio-temporal model gives superior 
results in terms of both E-measure and NMSE. 

B. Experiment 2 

The forward model A in the EEG source localization problem 
contains highly correlated columns, i.e. A is coherent. Therefore, it 
is of interest to investigate the proposed algorithm’s robustness to 
coherent forward models. The set-up in this experiment is basically 
the same as for the first experiment, except that undersampling ratio 
is now fixed to N/D — 0.4 and the elements in the forward model 
Aij are no longer Gaussian i.i.d. Instead we sample the rows of 
A from a correlated multivariate normal distribution, such that the 



Fig. 2. Normalized mean square error as a function of undersampling ratio. 
The data are g enerated from Y = AX + E with the sparsity pattern shown 
in figure where D = 100, T = 100 and SNR = lOdB. The entries in 
A are Gaussian i.i.d, i.e. Ai^j ~ N (0,1/Y). The results are averaged over 
100 realizations. 



Fig. 3. F-measure error as a function of undersampling ratio. The dat a are 
generated from Y = AX + E with the sparsity pattern shown in figure |4(ajl 
where D = 100, T = 100 and SNR = l^dB. The entries in A are Gaussian 
i.i.d, i.e. Ai^j ~ Af (0,1/Y). The results are averaged over 100 realizations. 


columns of A will be correlated. In particular, the correlation of 
the z’th and j’th column of A is given by We compute the 

NMSE and E-measure as a function of the correlation r. Note that 
the BG-AMP and DCS-AMP methods are designed for Gaussian i.i.d 
forward. These two methods are therefore not expected to perform 
well in this experiment, but we include them for completeness. The 
results are averaged over 50 realizations and are shown in figures [5] 
and[ 6 l The EP-based methods show some robustness to correlation 
in the columns of A, but the performance does degrade gradually 
when we increase the correlation. In particular, when changing the 
correlation r from 0.05 to 0.95, the E-measure for the spatio-temporal 
method only drops from approximate 0.92 to 0.89, but the NMSE 
increases from approximately 0.15 to 0.45. 
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Fig. 4. True and reconstructed support for the 5 considered methods. The undersampling ratio is N/D = 0.4 and D = 100, T = 100 and SNR = lOdB. 
a) True support, b) BG-AMP (NMSE = 0.805, F = 0.450), c) DCS-AMP (NMSE = 0.777, E = 0.763), d) Spatial EP (NMSE = 0.658, F = 0.902), e) Spatial 
MMV EP (NMSE = 0.833, E = 0.663), f) Spatio-temporal EP (NMSE = 0.618, E = 0.935). 
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Fig. 5. NMSE error as a function of undersampling ratio. The dat a are 
generated from Y = AX + E with the sparsity pattern shown in hgure |4(a)| 
The correlation of the Tth and j’th column of A is given by The 

results are averaged over 50 realizations. 
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Eig. 6. F-measure error as a function of undersampling ratio. The dat a are 
generated from Y = AX + E with the sparsity pattern shown in hgure [4^ 
The correlation of the Tth and j’th column of A is given by The 

results are averaged over 50 realizations. 

































VI. Conclusion & outlook 

We extended the structured spike and slab prior and the associated 
Expectation Propagation inference scheme to cope with smooth 
temporal evolution of the sparsity pattern. Based on numerical 
experiments with synthetic data we demonstrated the benefits of the 
extended model. In particular, we showed that the method outper¬ 
formed the reference methods. Future work includes developing an 
automated approach learning the hyperparameters of the prior and 
applying the proposed method to a real EEG source localization 
problem. 
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